Add ESGF links and more simulations to v1 data #60

forsyth2 · 2025-07-11T19:17:24Z

Add ESGF links and more simulations to v1 data

forsyth2 · 2025-07-11T20:08:50Z

@chengzhuzhang This is ready for review. I added the ESGF links that had data available. Web rendering can be seen at https://portal.nersc.gov/cfs/e3sm/forsyth/data_docs_60/html/v1/WaterCycle/simulation_data/simulation_table.html

forsyth2

@chengzhuzhang I added df0cfdb to begin the work of adding the large ensemble, but there's still a bit more to do on that, as described in this self-review.

Results from this commit can be seen at https://portal.nersc.gov/cfs/e3sm/forsyth/data_docs_60_try2/html/v1/WaterCycle/simulation_data/simulation_table.html.

forsyth2 · 2025-07-11T22:10:56Z

utils/make_symlinks.bash

@@ -0,0 +1,18 @@
+# This will be a problem if these simulations are ever removed from the publication archives!
+for i in $(seq 1 20); do
+    hsi ln -s /home/projects/e3sm/www/publication-archives/pub_archive_E3SM_1_0_LE_historical_ens$i /home/projects/e3sm/www/WaterCycle/E3SMv1/LR/LE_historical_ens$i


HSI/HPSS adds a @ to the end of its symlinks, but that may just be a visual indicator. In any case, HPSS paths and data sizes aren't being displayed on https://portal.nersc.gov/cfs/e3sm/forsyth/data_docs_60_try2/html/v1/WaterCycle/simulation_data/simulation_table.html

Some of the other data sets are showing a size of 0, but these doesn't even show a size of at all, so that makes me think the path isn't being found.

That said, they do show up in my output logs:

1 /home/projects/e3sm/www/WaterCycle/E3SMv1/LR/LE_historical_ens11 ----------------------- 0 total 512-byte blocks, 0 Files (0 bytes)

So, it seems to read it as an empty path. I wonder if symlinks show zero size?

This one shows up as 0 in the table:

341850452 2 /home/projects/e3sm/www/WaterCycle/E3SMv1/HR/cori-haswell.20190513.F2010LRtunedHR.plus4K.noCNT.ne30_oECv3/ ----------------------- 341850452 total 512-byte blocks, 2 Files (175,027,431,424 bytes)

So, it must be because 175x10^9 bytes is basically 0 TB (0.175x10^12). Indeed, this 113x10^12 one shows up as 113:

221651622324 820 /home/projects/e3sm/www/WaterCycle/E3SMv1/HR/20211021-maint-1.0-tro.A_WCYCLSSP585_CMIP6_HR.ne120_oRRS18v3_ICG.unc12-3rd-attempt/ ----------------------- 221651622324 total 512-byte blocks, 820 Files (113,485,630,629,888 bytes)

@forsyth2 could you double check the file size from /home/projects/e3sm/www/publication-archives/pub_archive_E3SM_1_0_LE_historical_ens$i, hopefully there is no corruption during zstash archive or transfer.

@chengzhuzhang It's definitely an issue with the symlinks; I'm discussing with NERSC support. The original paths are fine, e.g.,:

hsi du /home/projects/e3sm/www/publication-archives/pub_archive_E3SM_1_0_LE_historical_ens1 # 49970007900 95 /home/projects/e3sm/www/publication-archives/pub_archive_E3SM_1_0_LE_historical_ens1/ # ----------------------- # 49970007900 total 512-byte blocks, 95 Files (25,584,644,044,800 bytes)

forsyth2 · 2025-07-11T22:12:40Z

utils/simulations_v1_water_cycle.csv

 v1, WaterCycle, LR, DAMIP, 20190404.DECKv1b_H1_hist-GHG.ne30_oEC.edison, edison, , damip_hist-GHG, 1, none, ,
 v1, WaterCycle, LR, DAMIP, 20190404.DECKv1b_H2_hist-GHG.ne30_oEC.edison, edison, , damip_hist-GHG, 2, none, ,
 v1, WaterCycle, LR, DAMIP, 20190404.DECKv1b_H3_hist-GHG.ne30_oEC.edison, edison, , damip_hist-GHG, 3, none, ,
+v1, WaterCycle, LR, LargeEnsemble, LE_historical_ens1, , , historical-large-ensemble, 1, none, ,


Is the large ensemble data available on ESGF? If so, what's the experiment name? I assume it's not historical-large-ensemble

And actually on that note, some of the other v1 data sets may be missing ESGF links simply because I guessed the experiment name wrong. (I'm not seeing a way to know the experiment, or ensemble number for that matter, from https://e3sm.atlassian.net/wiki/spaces/ED/pages/4495441922/V1+Simulation+backfill+WIP)

Yes the v1 large ensemble data are available on ESGF in CMIP format. the experiment and ensemble names can be found here:https://github.com/E3SM-Project/datasm/blob/master/datasm/resources/v1_LE_dataset_spec.yaml. @TonyB9000 I think that you documented the mapping between LE native ensemble index to CMIP ensemble, e.g. r1i2p2f1. but forgot if that is for v1 or v2...Could you help check?

E3SM LE_archive_refactor.xlsb.xlsx

I think this was v1, since if it was v2 I would have had to distinguish them, but nothing in the naming indicated v1 or v2.

I keep poking around

The directory "/p/user_pub/e3sm/archive/External/" holds 5 related subdirectories:

E3SMv1_LE E3SMv1_LE_ext E3SMv1_LE_ssp370 E3SMv2_LE E3SMv2_LE_ssp370

The E3SMv2_LE has a file I created called "Arch_Translator_E3SMv2_LE", and it holds

Ensemble,Archive,Branch_time_in_parent ens6,v2.LR.historical_0111,40150.0 ens7,v2.LR.historical_0121,43800.0 ens8,v2.LR.historical_0131,47450.0 ens9,v2.LR.historical_0141,51100.0 ens10,v2.LR.historical_0161,58400.0 ens11,v2.LR.historical_0171,62050.0 ens12,v2.LR.historical_0181,65700.0 ens13,v2.LR.historical_0191,69350.0 ens14,v2.LR.historical_0211,76650.0 ens15,v2.LR.historical_0221,80300.0 ens16,v2.LR.historical_0231,83950.0 ens17,v2.LR.historical_0241,87600.0 ens18,v2.LR.historical_0261,94900.0 ens19,v2.LR.historical_0271,98550.0 ens20,v2.LR.historical_0281,102200.0 ens21,v2.LR.historical_0291,105850.0

(ensembles 1-5 missing because created independent of LE in v2 historical)

Likewise, E3SMv2_LE_ssp370/ holds a file named "Arch_Translator_E3SMv2_LE_ssp370", and it holds:

Ensemble,Archive,Branch_time_in_parent ens1,v2.LR.SSP370_0101,36500.0 ens6,v2.LR.SSP370_0111,40150.0 ens7,v2.LR.SSP370_0121,43800.0 ens8,v2.LR.SSP370_0131,47450.0 ens9,v2.LR.SSP370_0141,51100.0 ens2,v2.LR.SSP370_0151,54750.0 ens10,v2.LR.SSP370_0161,58400.0 ens11,v2.LR.SSP370_0171,62050.0 ens12,v2.LR.SSP370_0181,65700.0 ens13,v2.LR.SSP370_0191,69350.0 ens3,v2.LR.SSP370_0201,73000.0 ens14,v2.LR.SSP370_0211,76650.0 ens15,v2.LR.SSP370_0221,80300.0 ens16,v2.LR.SSP370_0231,83950.0 ens17,v2.LR.SSP370_0241,87600.0 ens4,v2.LR.SSP370_0251,91250.0 ens18,v2.LR.SSP370_0261,94900.0 ens19,v2.LR.SSP370_0271,98550.0 ens20,v2.LR.SSP370_0281,102200.0 ens21,v2.LR.SSP370_0291,105850.0 ens5,v2.LR.SSP370_0301,109500.0

I don't know how much that helps. Special functions were written that translate a given CMIP6 dataset_id to its corresponding E3SM "native" dataset_id. But for those functions to work (parent_native_dsid.sh, etc) one must supply the alternate "Archive_Map" for v1 or v2 LE, as these are not part of the E3SM "dataset_spec.yaml".

We can probably generate a "cmip-case" to "native-case" mapping file. Might take a day or so.

The historical cases and ssp370 cases are independent.

Walking the tree for "Project: E3SM" a bit further, you will see a "cmip_case", and yes, there is a 1-to-1 mapping between each native "ens#" and the corresponding "Project: CMIP" case, as in

"(native) ens#" corresponds to "(CMIP6) r#" of the variant label ("realization index").

E3SM: '1_0_LE': historical: start: 1850 end: 2014 ens: - ens1 - ens2 - ens3 - ens4 - ens5 - ens6 - ens7 - ens8 - ens9 - ens10 - ens11 - ens12 - ens13 - ens14 - ens15 - ens16 - ens17 - ens18 - ens19 - ens20 except: - TREFMNAV - TREFMXAV campaign: DECK-v1 science_driver: Water Cycle cmip_case: CMIP6.CMIP.UCSB.E3SM-1-0.historical

If you put these lines into your (acme1) ~/.bashrc file:

export DSM_GETPATH=/p/user_pub/e3sm/staging/Relocation/.dsm_get_root_path.sh alias list_e3sm="python /p/user_pub/e3sm/staging/tools/list_e3sm_dsids.py" alias list_cmip="python /p/user_pub/e3sm/staging/tools/list_cmip6_dsids.py" (and issue "source ~/.bashrc") And then 1. git clone https://github.com/E3SM-Project/datasm.git 2. cd datasm 3. conda env create -n <env_name> -f conda-env/prod.yml 4. conda activate <env_name> 5. pip install . Then your environment will have "datasm/util" and its functions available to any python, via "import datasm.util" or "from datasm.util import (selected functions)".

You can issue list_e3sm -d <path_to_the_dataset_spec> and generate ALL E3SM dataset_ids for that dataset_spec.

Likewise, use list_cmip -d <path_to_the_dataset_spec> to generate all corresponding CMIP6 dataset_ids.

These utilities will "walk" the respected YAML trees to express every branch. If no "-d dataset_spec" is given, the default dataset_spec.yaml (staging/resource/dataset_spec.yaml) is applied.

Other than this, I'm not quite sure what you need. Perhaps I can generate stuff for you, if I understand what you are looking for.

@TonyB9000 I'm just trying to determine the URL to link to. That requires knowing the query parameter values. But upon further inspection, it appears ESGF links might not be available for the large ensemble, in which case it's a moot point.

^Notice none of the available experiment IDs suggest large ensemble

Even clicking "historical", it's just the 5 ensemble members of the regular historical. Again, no indicator of large ensemble.

the v1 LE is published under project UCSB, if you leave out the Institution ID, the large ensemble should pop up.

True. But for v2 LE the project listed is "E3SM-Project" (21 ensembles). The CMIP6 datasets are not really distinguished as "LE" except that the variant labels range from r6 to r21 (16 ensembles). The native data is distinguished by Model = "2_0_LE". Likewise, the v1_LE native data has Model = "1_0_LE" (But native data is no longer available via ESGF/Metagrid.)

forsyth2 · 2025-07-11T22:44:13Z

utils/make_symlinks.bash

+done
+
+# Symlink last remaining large simulation
+# This will be a problem if ndk ever deletes the source!


@chengzhuzhang I meant to include this in the self-review I just posted. The symlinks are fine as long as we are guaranteed that people don't delete the source directories like /home/projects/e3sm/www/publication-archives/ or /home/n/ndk/2019/theta.20190910.branch_noCNT.n825def.unc06.A_WCYCL1950S_CMIP6_HR.ne120_oRRS18v3_ICG. Is that something we can be sure of?

I think so, tagging directory owners @TonyB9000 and @ndkeen, please make sure not to delete above directories.

forsyth2

@chengzhuzhang @TonyB9000 Ok I've added the large ensemble & the existing ESGF links. See https://portal.nersc.gov/cfs/e3sm/forsyth/data_docs_60_try9/html/v1/WaterCycle/simulation_data/simulation_table.html for a rendered version of the web page. This is ready for final review.

@TonyB9000 I've noted symlinked HPSS paths with (symlink) ...hpss_path...; is that going to interfere with any automated data retrieval you do from these pages?

chengzhuzhang · 2025-07-15T19:28:56Z

@forsyth2 thanks for adding v1 LE and the ESGF links. I just note that for the the simulation overview page, could you also add:

the v1 LE
the overview paper describing v1 LE: Stevenson et al. 2023, https://doi.org/10.1029/2023MS003653
Thanks!

forsyth2 · 2025-07-15T20:07:26Z

Ok, added in https://portal.nersc.gov/cfs/e3sm/forsyth/data_docs_60_try10/html/v1/WaterCycle/index.html

TonyB9000 · 2025-07-15T20:21:34Z

@forsyth2 @chengzhuzhang
I've noted symlinked HPSS paths with (symlink) ...hpss_path...; is that going to interfere with any automated data retrieval you do from these pages?

Yes, it most certainly will. A column labeled "HPSS Path" should not be polluted with non-functional commentary. People need to understand that we use computers to automate. As nice as it is to have human-friendly material, such should be secondary to functional considerations.

Personally, I would have the default date-timestamp on ALL log-files be 8+ HEX chars (like "D85A33B2", representing Epoch-seconds). Very unfriendly to look at? Then pass it through a "prettifier" that converts the log entry to "2025-07-15 09:42:30", or if you like, "The Fifteenth Day of Our Lord, July 2025 AD, at the 9th hour, 42nd minute, and 30th second of the morning in the Pacific Standard Timezone".

Instead, I will need to munge code to toss out everything in the returned HPSS-Path that occurs before the first "/". For now, at least.

forsyth2 · 2025-07-15T20:51:01Z

People need to understand that we use computers to automate. As nice as it is to have human-friendly material, such should be secondary to functional considerations.

This page is for humans though. It almost seems like we should have some sort of a output file meant for a computer to read, rather than having a program parse the information from HTML... As I noted in a previous email:

I think perhaps the most straightforward thing to do here is to modify "generate_tables" in https://github.com/E3SM-Project/e3sm_data_docs/blob/main/utils/generate_tables.py#L227 to produce not only the rst table but also an equivalent csv (or better yet construct the table from csv per #30). Then, it's exactly the data you need, in the right format.

That is, I believe the fundamental issue here is that we're relying on HTML serving both computers & humans, when we should just be outputting computer-readable material elsewhere.

forsyth2 · 2025-07-15T20:51:53Z

@TonyB9000 if you provide me with an exact list of data you need from these tables, I should be able to easily produce that in a machine readable file.

forsyth2 · 2025-07-15T20:59:20Z

if you provide me with an exact list of data you need from these tables, I should be able to easily produce that in a machine readable file.

That would need to be part of a separate PR though, as the work is distinct from adding the v1 data.
In the meantime, do you need all the columns clean, or can I add the "(symlink)" note to say the say the simulation name column?

TonyB9000 · 2025-07-15T22:04:59Z

@forsyth2

we're relying on HTML serving both computers & humans

Indeed. In fact, to avoid inconsistencies, the focus should be to produce the "machine-readable" version of materials, and then use that as a primary source for HTML creation and human readable stuff, augmented with commentaries, etc.

Machine ==> Human: Easy
Human ==> Machine: HARD.

forsyth2 · 2025-07-15T22:14:37Z

@TonyB9000 Great, I prototyped a solution at #61. Can you please review #61 (review)? If you approve of that, I think I can go ahead and merge both PRs.

TonyB9000 · 2025-07-15T22:23:55Z

@forsyth2

do you need all the columns clean

I should clarify. At runtime, I consult my own "NERSC_Archive_Locator" file, whose entries are (e.g.):

LR:AMIP,v2.LR.amip_0101,2,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.amip_0101
LR:AMIP,v2.LR.amip_0201,2,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.amip_0201
LR:AMIP,v2.LR.amip_0301,2,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.amip_0301
LR:AMIP,v2.LR.amip_0101_bonus,2,na,na,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.amip_0101_bonus
LR:RFMIP,v2.LR.piClim-control,1,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.piClim-control
LR:RFMIP,v2.LR.piClim-histall_0021,3,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.piClim-histall_0021
LR:RFMIP,v2.LR.piClim-histall_0031,3,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.piClim-histall_0031
LR:RFMIP,v2.LR.piClim-histall_0041,3,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.piClim-histall_0041
LR:RFMIP,v2.LR.piClim-histaer_0021,3,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.piClim-histaer_0021
LR:RFMIP,v2.LR.piClim-histaer_0031,3,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.piClim-histaer_0031
LR:RFMIP,v2.LR.piClim-histaer_0041,3,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2.LR.piClim-histaer_0041
LR:Other,v2_ndgclim_t6h_1850aer,0,na,na,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2_ndgclim_t6h_1850aer
LR:Other,v2_ndgclim_t6h_2010aer,0,na,na,/home/projects/e3sm/www/WaterCycle/E3SMv2/LR/v2_ndgclim_t6h_2010aer
NARRM:DECK,v2.NARRM.piControl,80,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/NARRM/v2.NARRM.piControl
NARRM:DECK,v2.NARRM.abrupt-4xCO2_0101,24,CMIP,Native,/home/projects/e3sm/www/WaterCycle/E3SMv2/NARRM/v2.NARRM.abrupt-4xCO2_0101

This was created by MANUALLY scraping the HTML data. Note that the hyperlinks are removed, I don't use the first column.

At runtime (due to the magic of having created a "local Archive_Map" (paths to archives on Chrysalis AND zstash file extraction patterns), if I don't have the data in the warehouse BUT it is listed in the Local Archive_Map, I take the "basename" of the archive path (the case_id, like "DECK,v2.NARRM.piControl"), and I look it up in the NERSC_Archive_Locator (field 2). Where a match is found, I return fields 3 (Volume) and 6 (NERSC HPSS Archive Path).

I then (hope to) use "zstash --check" to pull over the archive in question.

Since I (presently) create the NERSC Archive_Locator manually, I simply edit out extraneous material.

forsyth2 · 2025-07-15T22:29:44Z

I'm a little confused. If the current process is manual, then what's the problem with having "(symlink) " in the hpss path cell?

In any case, #61 should pave the way to full automation well.

TonyB9000 · 2025-07-15T22:40:54Z

@forsyth2 I guess "semi-manual", as I do use tools to strip formatting from the HTML copy. But yes, it is only a minor inconvenience. These things do add up (mapfile/region-file selection, user_metadata updates, etc) so I am simply venting my frustrations on the system overall. Each of these little (manual) things are:

Something to forget to do until the last minute when things fail
Opportunities to mis-copy or otherwise screw up configuration

Hence, forcing automation not only eases the manual burden, it isolates decisions to a "fix-it-once-and-forget-it" regime of operation.

forsyth2 self-assigned this Jul 11, 2025

Add ESGF links for v1 data

7180000

forsyth2 force-pushed the v1-data-esgf branch from f1fa594 to 7180000 Compare July 11, 2025 20:07

forsyth2 marked this pull request as ready for review July 11, 2025 20:08

forsyth2 requested a review from chengzhuzhang July 11, 2025 20:09

forsyth2 mentioned this pull request Jul 11, 2025

Add v1 data #59

Merged

Begin adding large ensemble

df0cfdb

forsyth2 changed the title ~~Add ESGF links for v1 data~~ Add ESGF links and more simulations to v1 data Jul 11, 2025

forsyth2 commented Jul 11, 2025

View reviewed changes

forsyth2 added 2 commits July 15, 2025 11:41

Further v1 data updates

09d62f6

Refactor get_esgf

1c298f3

forsyth2 commented Jul 15, 2025

View reviewed changes

Include large ensemble in docs

7bb6f87

forsyth2 mentioned this pull request Jul 15, 2025

Output csv #61

Merged

forsyth2 force-pushed the v1-data-esgf branch from b892876 to 0bf0be8 Compare July 15, 2025 23:45

Clean up code

c990d38

forsyth2 force-pushed the v1-data-esgf branch from 0bf0be8 to c990d38 Compare July 15, 2025 23:47

forsyth2 merged commit 5c5dfc3 into main Jul 15, 2025
1 check passed

forsyth2 deleted the v1-data-esgf branch July 15, 2025 23:48

forsyth2 mentioned this pull request Aug 1, 2025

Add BGC and Cryo v1 data to centralized archive #63

Closed

Add ESGF links and more simulations to v1 data #60

Add ESGF links and more simulations to v1 data #60

Uh oh!

Conversation

forsyth2 commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

forsyth2 commented Jul 11, 2025

Uh oh!

forsyth2 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TonyB9000 Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

forsyth2 left a comment

Choose a reason for hiding this comment

Uh oh!

chengzhuzhang commented Jul 15, 2025

Uh oh!

forsyth2 commented Jul 15, 2025

Uh oh!

TonyB9000 commented Jul 15, 2025

Uh oh!

forsyth2 commented Jul 15, 2025

Uh oh!

forsyth2 commented Jul 15, 2025

Uh oh!

forsyth2 commented Jul 15, 2025

Uh oh!

TonyB9000 commented Jul 15, 2025

Uh oh!

forsyth2 commented Jul 15, 2025

Uh oh!

TonyB9000 commented Jul 15, 2025

Uh oh!

forsyth2 commented Jul 15, 2025

Uh oh!

TonyB9000 commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

forsyth2 commented Jul 11, 2025 •

edited

Loading

TonyB9000 Jul 15, 2025 •

edited

Loading

TonyB9000 commented Jul 15, 2025 •

edited

Loading